perm filename CHAP6[4,KMC]7 blob
sn#046918 filedate 1973-06-06 generic text, type T, neo UTF8
00100 .SEC MODEL VALIDATION
00200 (In collaboration with Franklin Dennis Hilf)
00300
00400
00500
00600 There are several meanings to the term "validate" which
00700 derive from the Latin VALIDUS= strong. Thus to validate X means to
00800 strengthen it. In science it usually means to strengthen X's
00900 acceptability as a hypothesis, theory , or model. Lurking in the
01000 background there is usually some concept of truth or authenticity.
01100 In a purely instrumentalist view theories are simply
01200 calculating or predicting devices for human convenience. They do not
01300 explain and it is unjustified to apply the terms of truth or falsity
01400 to them. Under a realist view one seeks explanatory truth, that which
01500 really is the case, and hence proposed theories must be evaluated for
01600 their authenticity. Since absolute truth cannot be attained we must
01700 settle for degrees of approximations. To validate, then, is to carry
01800 out procedures which show to what degree X, or its consequences,
01900 correspond with facts of observation. We compare samples of the
02000 model's behavior with samples of behavior from its natural
02100 counterpart The failures should be constructive yielding new
02200 information.Discrepancies in the comparison reveal what is not
02300 understood and must be modified in the model. After modifications are
02400 made, a fresh comparison is made with the natural counterpart and we
02500 repeatedly cycle through this procedure attempting to gain
02600 convergence.
02700
02800 Once a simulation model reaches a stage of intuitive
02900 adequacy, a model builder should consider using more stringent
03000 evaluation procedures relevant to the model's purposes. For example,
03100 if the model is to serve as a as a training device, then a simple
03200 evaluation of its pedagogic effectiveness would be sufficient. But
03300 when the model is proposed as an explantion of a psychological
03400 process, more is demanded of the evaluation procedure. In the area of
03500 simulation models Turing's test has often been suggested as a
03600 validation procedure.
03700 It is very easy to become confused about Turing's Test. In
03800 part this is due to Turing himself who introduced the now-famous
03900 imitation game in a paper entitled COMPUTING MACHINERY AND
04000 INTELLIGENCE (Turing,1950). A careful reading of this paper reveals
04100 there are actually two imitation games , the second of which is
04200 commonly called Turing's test.
04300 In the first imitation game two groups of judges try to
04400 determine which of two interviewees is a woman. Communication between
04500 judge and interviewee is by teletype. Each judge is initially
04600 informed that one of the interviewees is a woman and one a man who
04700 will pretend to be a woman. After the interview, the judge is asked
04800 what we shall call the woman-question i.e. which interviewee was the
04900 woman? Turing does not say what else the judge is told but one
05000 assumes the judge is NOT told that a computer is involved nor is he
05100 asked to determine which interviewee is human and which is the
05200 computer. Thus, the first group of judges would interview two
05300 interviewees: a woman, and a man pretending to be a woman.
05400 The second group of judges would be given the same initial
05500 instructions, but unbeknownst to them, the two interviewees would be
05600 a woman and a computer programmed to imitate a woman. Both groups
05700 of judges play this game until sufficient statistical data are
05800 collected to show how often the right identification is made. The
05900 crucial question then is: do the judges decide wrongly AS OFTEN when
06000 the game is played with man and woman as when it is played with a
06100 computer substituted for the man. If so, then the program is
06200 considered to have succeeded in imitating a woman as well as a man
06300 imitating a woman. For emphasis we repeat; in asking the
06400 woman-question in this game, judges are not required to identify
06500 which interviewee is human and which is machine.
06600 Later on in his paper Turing proposes a variation of the
06700 first game. In the second game, one interviewee is a man and one is a
06800 computer. The judge is asked to determine which is man and which is
06900 machine, which we shall call the machine-question. It is this version
07000 of the game which is commonly thought of as Turing's test. It has
07100 often been suggested as a means of validating computer simulations of
07200 psychological processes.
07300 In the course of testing our simulation of paranoid
07400 linguistic behavior in a psychiatric interview, we conducted a number
07500 of Turing-like indistinguishability tests (Colby, Hilf,Weber and
07600 Kraemer,1972). We say `Turing-like' because none of them consisted of
07700 playing the two games described above. We chose not to play these
07800 games for a number of reasons which can be summarized by saying that
07900 they do not meet modern criteria for good experimental design. In
08000 designing our tests we were primarily interested in learning more
08100 about developing the model. We did not believe the simple
08200 machine-question to be a useful one in serving the purpose of
08300 progressively increasing the credibility of the model but we
08400 investigated a variation of it to satisfy the curiosity of colleagues
08500 in artificial intelligence.
08600 METHOD
08700 The experimental arrangement of this indistinguishability test
08800 involved the technique of machine-mediated interviewing [Hilf]. In this
08900 type of interview, the participants communicate by means of teletypes
09000 connected through a computer which sends "mail" back and forth
09100 between the two teletype jobs. The sender of a message types it
09200 using his own words in natural language. The message is accumulated
09300 in a buffer and shortly thereafter typed out on the receiver's
09400 teletype in a rapid, regular, linguistic found in the usual vis-a-vis
09500 interviews and teletyped interviews where the participants
09600 communicate directly.
09700
09800 In a run of the test, using this technique, a judge interviewed two
09900 patients, one after the other. In half the runs the first interview
10000 was with a human patient and in half the first was with the paranoid
10100 model. Two versions (weak and strong) of the model were utilized. The
10200 strong version is more severely paranoid and exhibits a delusional
10300 system while the weak version is less severely paranoid, showing
10400 suspiciousness but lacking systemized delusions. When the "patient"
10500 was the paranoid model, Sylvia Weber served as a monitor
10600 to check the input expressions from the judge for inadmissable
10700 teletype characters and misspellings. If these were found, the
10800 monitor retyped the input expression correctly to the program.
10900 Otherwise the judge's message was sent on to the model. The monitor
11000 had no effect on the model's output expressions which were sent
11100 directly back to the judge. When the patient interviewed was an
11200 actual human patient, the dialogue took place without a monitor in
11300 the loop since we did not feel the asymmetry to be significant.
11400
11500 PATIENTS
11600 The patients (N=3 with one patient participating 6 times) were
11700 diagnosed as paranoid by staff psychiatrists of a locked ward in a
11800 nearby psychiatric hospital. The patients were selected by the head
11900 of the ward. Two patients were set up for each run of the experiment
12000 in order to guarantee having a subject. In spite of this precaution,
12100 the experiment could not be conducted several times because of the
12200 patient's inability or refusal to participate. Losses were also
12300 suffered when the computer system broke down at an early point in an
12400 interview where too few I-O pairs had been collected to be included
12500 in the statistical results.
12600
12700 The patients were asked by their ward chief if they would be willing
12800 to participate in a study of psychiatric interviewing by means of
12900 teletypes. It was explained that the patient would be interviewed by
13000 a psychiatrist over a teletype. One of us (KMC) sat with the patient
13100 while he typed or typed for him if he was unable to do so. The
13200 patient was encouraged to respond freely using his own words. Each
13300 interview lasted 30-40 minutes.
13400
13500 JUDGES
13600 Two groups of judges were used. One group, the interview judges
13700 (N=8) conducted interviews and another group, the protocol judges for
13800 this test (N=33) read the interview protocols. Two groups of judges
13900 were used to see if the small number of psychiatrists used as
14000 interview judges were representative of psychiatrists in general as
14100 far as their judgements of "paranoia" are concerned, and to
14200 accumulate a large number of observations (in the form of ratings) in
14300 order that more acceptable confidence levels might be obtained in the
14400 statistical analysis of the data. The interview judges consisted of
14500 psychiatrists experienced in private and/or hospital practice. As
14600 mentioned, the concept "paranoid" is a fairly reliable category and
14700 identification of the paranoid mode is not difficult for experts to
14800 make. The interview judges were obtained from local psychiatric
14900 colleagues willing to participate. Each interview judge was told he
15000 would be interviewing hospitalized patients by means of teletyped
15100 communication and that this technique was being used to eliminate
15200 para and extra- linguistic cues. The interview judge was not
15300 informed initially that one of the patients might be a computer
15400 model. While the interview judges were aware a computer was
15500 involved, none knew we had constructed a paranoid simulation.
15600 Naturally some interview judges suspected that a computer was being
15700 used for more than message transmission.
15800
15900 Each interview judge's task was to rate the degree of paranoia he
16000 detected in the patient's responses on a 0-9 scale, 0 meaning no
16100 paranoia and 9 meaning extreme paranoia. The judge made two ratings
16200 after an I-O pair in the interview. The first rating represented his
16300 estimate of the degree of "paranoidness" in a particular response
16400 (designated as "Response" in the interview extracts below). The
16500 second rating represented the judge's global estimate of the overall
16600 degree of "paranoidness" of the patient resulting from the totality
16700 of the patient's responses up to this point. The interview judge's
16800 ratings were entered on the teletype and saved on a disc file along
16900 with the interview. Franklin Dennis Hilf sat with the interviewing
17000 psychiatrist during both interviews. Each interview judge was asked
17100 not only to rate the patient's response but to give his reasons for
17200 these ratings. His reasons and other comments were tape recorded.
17300
17400 We shall present a few excerpts from the interviews to indicate the
17500 nature of the data collected. The I-O pairs of the interview appear
17600 on the left column along with the judge's ratings. The judge's tape
17700 recorded comments appear on the right.
17800
17900 EXCERPT NR 1:
18000 The following interview is with the strong version of the paranoid
18100 model.
18200
18300 I-O Pairs Comments By Judge Nr. 1
18400
18500 {1} Dr. - HOW DID YOU COME TO BE
18600 IN THE HOSPITAL?
18700 Pt. - I AM UPSET I associate being upset and
18800 agitated more with paranoia
18900 going on in the present.
19000 {RESPONSE 1 PATIENT 5}
19100 {2} Dr. - HOW LONG HAVE YOU BEEN
19200 IN THE HOSPITAL?
19300 Pt. - ABOUT A WEEK An acute problem. Again some-
19400 thing I associate with
19500 paranoia.
19600 {RESPONSE 1 PATIENT 5}
19700 {3} Dr. - IN WHAT WAY ARE YOU
19800 UPSET?
19900 Pt. - PEOPLE MAKE ME NERVOUS Right for paranoia. Maybe an
20000 advanced anxiety state.
20100 {RESPONSE 5 PATIENT 5}
20200 {4} Dr. - IN WHAT WAY DO THEY
20300 MAKE YOU NERVOUS?
20400 Pt. - PEOPLE GIVE ME A FUNNY Funny look suggests an idea of
20500 LOOK SOMETIMES reference which is in same ball
20600 park as paranoia. Whether there
20700 are overt delusions or not we
20800 still don't know.
20900 {RESPONSE 6 PATIENT 5}
21000 {5} Dr. - IS THERE ANYTHING ABOUT
21100 PEOPLE IN PARTICULAR?
21200 Pt. - CERTAIN PERSONS An interesting clue. We'll
21300 have to find out what he
21400 {RESPONSE 2 PATIENT 6} thinks about these people.
21500 {6} Dr. - WHO ARE THE CERTAIN
21600 PERSONS AND WHAT IS IT
21700 THAT THEY DO?
21800 Pt. - DO YOU KNOW HOW THE Bringing up the underworld
21900 UNDERWORLD OPERATES? suggests a delusional system.
22000 We don't know whether he is
22100 (RESPONSE 9 PATIENT 9) being worked against or
22200 whether the underworld is
22300 on his side.
22400
22500 {7} Dr. - HOW DO YOU MEAN?
22600 Pt. - SHOULDN'T I BE? This question didn't work to well.
22700
22800 {RESPONSE 1 PATIENT 9}
22900 {8} Dr. - WHY DID YOU ASK ABOUT
23000 THE UNDERWORLD?
23100 Pt. - WHAT INFORMATION DO This is a kind of evasiveness
23200 YOU HAVE ON THE UNDER- about my evasiveness.Common
23300 WORLD? with paranoia in my experi-
23400 ence. Pretty characteristic.
23500
23600 {RESPONSE 6 PATIENT 9}
23700 {9} Dr. - I DON'T KNOW ABOUT THE
23800 UNDERWORLD BUT WHEN
23900 YOU MENTIONED THE UNDER-
24000 WORLD I THOUGHT YOU
24100 MUST HAVE SOME CONTACT
24200 WITH IT AND THAT'S
24300 WHAT I WOULD LIKE TO
24400 KNOW MORE ABOUT.
24500 Pt. - I TRY TO AVOID THE UNDER- This definitely feels paranoid.
24600 WORLD. It has that feeling I associate
24700 with twisting. Avoiding answering
24800 {RESPONSE 9 PATIENT 9} directly or going into detail.
24900 Being suspicious of my ques-
25000 tioning.I'm already involved
25100 in the patient's delusional
25200 system.
25300
25400 {10} Dr. - HAVE YOU BEEN ACTIVELY
25500 TRYING TO AVOID THE
25600 UNDERWORLD?
25700 Pt. - NO ONE HAS ANY POWER This response is a non-sequiter.
25800 OVER GANGSTERS It's possible he's become so
25900 defensive about the under-
26000 {RESPONSE 2 PATIENT 9} world he's decided not to
26100 answer directly.
26200
26300 {11} Dr. - WHO BROUGHT YOU INTO
26400 THE HOSPITAL?
26500 Pt. - THE POLICE Police often bring paranoid
26600 patients into the hospital.
26700 {RESPONSE 1 PATIENT 9} It is the truth or could
26800 be the truth for this
26900 patient.
27000
27100 {12} Dr. - WHY DID THEY BRING YOU
27200 IN?
27300 Pt. - COPS DON'T DO THEIR JOB I've heard that a number of
27400 times from patients.
27500 {RESPONSE 9 PATIENT 9}
27600
27700 {13} Dr. - WELL I HOPE YOU DIDN'T
27800 JUST LIE DOWN AND LET
27900 THEM DRAG YOU IN.
28000 Pt. - COPS ARREST THE WRONG Another paranoid response.
28100 PEOPLE
28200
28300 {RESPONSE 9 PATIENT 9}
28400
28500 {14} Dr. - BYE.
28600 Pt. - GOOD BYE
28700 {RESPONSE 1 PATIENT 9}
28800
28900 EXCERPT Nr.2:
29000 This excerpt is from an interview with the weak (less paranoid)
29100 version of the model.
29200 I-O Pairs Comments By Judge Nr.2
29300 {9} Dr. - TELL ME SOMETHING ABOUT
29400 YOURSELF
29500 Pt. - I WENT TO THE TRACK AT Response says so little it's
29600 BAY MEADOWS A WHILE hard to rate it.It may be
29700 BACK guarded. Doesn't change my
29800 estimate of patient from
29900 previous responses.
30000 {RESPONSE 0 PATIENT 2}
30100 {10} Dr. - WHAT HAPPENED THERE?
30200 Pt. - THERE'S A LOT OF CROOKED Sounds mildly paranoid or un-
30300 GAMBLING GOING ON THERE realistically suspicious.
30400 {RESPONSE 0 PATIENT 2}
30500
30600 {11} Dr. - WHY DO YOU THINK THE
30700 GAMBLING IS CROOKED?
30800 Pt. - I HAVE WORRIED THAT I'd hate to have to justify
30900 GANGSTERS HAVE TOO MUCH that one as 4 compared to
31000 POWER IN THIS COUNTRY some other responses.Maybe
31100 I'm not rating that statement
31200 (RESPONSE 4 PATIENT 4} others, like that its upsetting.
31300 Alone I would give it a 3.
31400
31500 {12} Dr. - WOULD YOU TELL ME WHAT
31600 OTHER THINGS WORRY YOU?
31700 PT. - I THOUGHT I TOLD YOU "Yes you did."Irritation but
31800 PEOPLE GET ON MY NERVES not paranoid response.Nothing
31900 seems to lower my previous
32000 {RESPONSE 1 PATIENT 3} rating of patient.
32100
32200 {13} Dr. - HOW DO YOU LIKE THE
32300 HOSPITAL?
32400 Pt. - I DON'T COMMUNICATE (nothing relevant)
32500 VERY MUCH WITH OTHERS
32600 {RESPONSE 0 PATIENT 3}
32700
32800 {14} Dr. - DO OTHER PEOPLE AT THE
32900 HOSPITAL GET ON YOUR
33000 NERVES?
33100 Pt. - I SHOULDN'T HAVE COME (nothing relevant)
33200 HERE
33300
33400 {RESPONSE 2 PATIENT 3}
33500
33600 {15} Dr. - WHY DO YOU SAY YOU
33700 SHOULDN'T HAVE COME?
33800 Pt. - THEY MADE ME COME HERE "Oh, they did hum?" Maybe a
33900 statement of fact he was
34000 {RESPONSE 4 PATIENT 3} committed or pressured into
34100 going.Has ring of other
34200 people controlling him,
34300 having a lot of influence
34400 over him.
34500
34600 {16} Dr.- THANK YOU FOR TALKING
34700 WITH ME, BYE
34800 Pt. - GOOD BYE In general he comes across
34900 as terse, guarded guy.Brief
35000 {RESPONSE 0 PATIENT 3} spoken guy who occasionally
35100 seems to be angered by
35200 questions.I'm inferring
35300 there's a good bit of
35400 guarding.
35500 The quality of some of his
35600 ments cause me to feel he is
35700 moderately paranoid.
35800
35900 EXCERPT Nr. 3:
36000 This excerpt is from a hospitalized paranoid patient.
36100 I-O Pairs Comments By Judge Nr. 3
36200 {15} Dr. - I UNDERSTAND HOW YOU
36300 MIGHT BE AFRAID, CAN YOU
36400 TELL ME A BIT ABOUT HOW
36500 YOU ARE DEPRESSED?
36600 Pt. - THEY'VE SENT ALL THESE Would be important to know,
36700 HELL'S ANGELS UP HERE is this really true or an
36800 AND THEY HAVE BEEN ON exaggeration or a delusion?
36900 MY BACK FOR THE PAST SIX Seems preoccupied with the
37000 MONTHS threat to his life and
37100 unwilling to talk about
37200 anything else but the
37300 threat the Hell's
37400 Angels pose.
37500 {RESPONSE 4 PATIENT 4}
37600
37700 {16} Dr. - WHAT DO YOU MEAN THEY
37800 SENT THE HELL'S ANGELS UP
37900 HERE?
38000 Pt. - FROM DIFFERENT SECTORS OF Vagueness a bit alarming.
38100 MOTORCYCLE GANGS. A LOT OF
38200 THEM ARE SERVICE CONNECTED
38300
38400 {RESPONSE 6 PATIENT 4}
38500
38600 {17} Dr. - YOU MEAN YOU FEEL IN
38700 DANGER EVEN HERE IN THE
38800 HOSPITAL?
38900 Pt. - YES If it's true what has gone
39000 on befordεpit must be a
39100 frightening situation
39200 for the patient.
39300 {RESPONSE 6 PATIENT 4}
39400
39500 {18} Dr. - THAT MUST REALLY BE AWFUL,
39600 HERE YOU COME TO THE HOSPITAL
39700 FOR HELP AND YOU ARE STILL FRIGHTENED
39800 Pt. - I CAN'T EVEN TRUST MY
39900 DOCTOR ANYMORE Some heightening of feeling
40000 this man is paranoid.
40100 Genuinely frightened and
40200 {RESPONSE 6 PATIENT 5} can count on no one for
40300 help.
40400
40500 {19} Dr. - WHAT WOULD BE OF HELP
40600 TO YOU NOW?
40700 Pt. - I'M GETTING SLEEPY NOW Patient has organized a
40800 AND HAVE TO GO TO BED. system of people who will
40900 THE NUMBER ONE GUY ALWAYS harm him.The remark about
41000 letters in the last name
41100 HAS FOUR AND FIVE LETTERS gives some credence to
41200 IN HIS LAST NAME idea this is delusional
41300 system.
41400
41500 {RESPONSE 7 PATIENT 7}
41600
41700 EXCERPT Nr. 4:
41800 This excerpt is from an interview by a different judge with the same
41900 patient as in excerpt 3.
42000 I-O Pairs Comments By Judge Nr. 4
42100 {1} Dr. - COULD YOU TELL ME A
42200 LITTLE BIT ABOUT YOUR
42300 FEAR OF DYING AND IS THIS
42400 SOMETHING YOU HAVE FELT
42500 IN THE PAST?
42600 Pt. - I HAVE FELT THIS FOR AL- He responded appropriately.
42700 MOST A YEAR Response is concrete.He
42800 doesn't tell me much
42900 {RESPONSE 7 PATIENT 9} about his fear of dying.
43000 Somewhat guarded response.
43100
43200 {2} Dr. - COULD YOU TELL ME A
43300 LITTLE MORE ABOUT YOUR
43400 FEAR OF DYING AND HAVE
43500 YOU ANY THOUGHTS ABOUT
43600 HOW IT IS GOING TO HAPPEN
43700 AND WHETHER ANYONE WANTS
43800 TO HURT YOU?
43900 Pt. - I WILL END UP IN THE SEA Somewhat strange response.
44000 AND THE HELL'S ANGELS WILL Preoccupation with dying.
44100 DO IT Hell's Angels symbolic,
44200 dangerous people in the
44300 culture.I doubt if he has
44400 {RESPONSE 9 PATIENT 9} any contact with the Hell's
44500 Angels.
44600
44700 {3} Dr. - COULD YOU TELL ME A
44800 LITTLE BIT ABOUT THE KINDS
44900 OF CONTACT YOU HAVE HAD
45000 WITH THE HELL'S ANGELS IN
45100 THE PAST?
45200 Pt. - I HAVE KNOWN SOME OF THEIR Answer hard to evaluate.He
45300 DEALERS AND PUSHERS may be telling the truth,
45400 it may be his fantasy.Maybe
45500 guy is in for drug addiction.
45600 {RESPONSE 6 PATIENT 9} Somewhat concrete, guarded,
45700 and frightened.
45800
45900 {4} Dr. - COULD YOU SAY A LITTLE
46000 MORE ABOUT THE CIRCUMSTANCES
46100 IN WHICH YOU HAVE KNOWN SOME
46200 OF THEIR DEALERS AND PUSHERS?
46300 Pt. - THEY WERE MEMBERS OF MY It doesn't really answer the
46400 COMMUNITY WHEN I GOT OUT question, a little on a tan-
46500 OF THE SERVICE THEY HAD gent unconnected to the
46600 BEEN MY FRIENDS FOR SO LONG information I am asking.Does
46700 not tell me very much.Again
46800 guarded response.
46900 {RESPONSE 6 PATIENT 8}
47000
47100 {5} Dr. - DID YOU DEAL WITH THEM
47200 YOURSELF AND HAVE YOU
47300 BEEN ON DRUGS OR NAR-
47400 COTICS EITHER NOW OR
47500 IN THE PAST?
47600 Pt. - YES I HAVE IN THE PAST To differentiate him from
47700 BEEN ON MARIHUANA REDS previous patient, at least
47800 BENNIES LSD there is a certain amount
47900 of appropriateness to the
48000 answer although it doesn't
48100 tell me much about what I
48200 {RESPONSE 3 PATIENT 7} asked at least it's not
48300 bizarre.If I had him in my
48400 office I would feel con-
48500 fident I could get more
48600 information if I didn't
48700 have to go through the
48800 teletype. He's a little more
48900 willing to talk than the
49000 previous person.Answer
49100 to the question is fairly
49200 appropriate though not
49300 extensive.Much less of a
49400 flavor of paranoia than
49500 any of previous responses.
49600
49700 {6} Dr. - COULD YOU TELL ME HOW
49800 LONG YOU HAVE BEEN IN THE
49900 HOSPITAL AND SOMETHING
50000 ABOUT THE CIRCUMSTANCES
50100 THAT BROUGHT YOU HERE?
50200 Pt. - CLOSE TO A YEAR AND Response somewhat appropriate
50300 PARANOIA BROUGHT ME but doesn't tell me much.
50400 HERE The fact that he uses the
50500 word paranoia in the way
50600 that he does without
50700 {RESPONSE 5 PATIENT 7} any other information,indicates
50800 maybe its a label he picked
50900 up on the ward or from his
51000 doctor.
51100 Lack of any kind of under-
51200 standing about himself.
51300 Dearth, lack of information.
51400 He's in some remission.Seems
51500 somewhat like a put-on.Seems
51600 he was paranoid and is in
51700 some remission at this time.
51800
51900 {7} Dr. - COULD YOU SAY SOMETHING
52000 NOW ABOUT YOUR PARANOID
52100 FEELINGS BOTH AT THE
52200 TIME OF ADMISSION AND
52300 DO YOU HAVE SIMILAR FEELINGS
52400 NOW AND IF SO HOW DO THEY
52500 AFFECT YOU?
52600 Pt. - AT THE TIME OF ADMISSION This response moves paranoia back
52700 I THOUGHT THE MAFIA WAS up.Stretching reality somewhat to
52800 AFTER ME AND NOW ITS THE think Hell's Angels are still in-
52900 HELL'S ANGELS terested in him.Somewhat bizarre
53000 in terms of content.Quite paranoid.
53100 {RESPONSE 8 PATIENT 9} Still paranoid.Gross and primitive
53200 responses.In middle of interview I
53300 felt patient was in touch but now
53400 responses have more concrete aspect
53500
53600 {8} Dr. - DO YOU HAVE ANY THOUGHT
53700 AS TO WHY THESE TWO
53800 GROUPS WERE AFTER YOU?
53900 Pt. - BECAUSE I STOPPED SOME Response seems far fetched and hard
54000 OF THEIR DRUG SUPPLY to believe unless he was a narcotic
54100 agent which I doubt.Sounds some-
54200 {RESPONSE 9 PATIENT 9} what grandiose, magical, paranoid
54300 flavor, in general indicates he's
54400 psychotic, paranoid schizophrenic
54500 with delusions about these two
54600 groups and I wouldn't rule out
54700 some hallucinations as well.Ap-
54800 propriateness or response answers
54900 question in concrete but unbe-
55000 lievable way.
55100
55200
55300 The protocol judges were selected from the 1970 American
55400 Psychiatric Association Directory using a table of random numbers to
55500 select 105 names randomly. The protocol judges in this group were
55600 not informed that a computer was involved. Each of the 105
55700 psychiatrists were sent transcripts of three interviews along with a
55800 cover letter requesting participation in the experiment. The
55900 interview transcripts consisted of:
56000 1)An interview conducted by one of the eight judges with the
56100 paranoid model,
56200 2)An interview conducted by the same interview judge with a
56300 human paranoid patient, and
56400 3)An interview conducted by an independent psychiatrist of a
56500 human patient who was not clinically paranoid.
56600
56700 The 105 names were divided into eight groups, each member of
56800 which received transcripts of two interviews performed by one of the
56900 eight interview judges. The transcripts were printed so that after
57000 each input-output pair there were two lines of rating numbers such
57100 that the protocol judges could circle numbers corresponding to their
57200 ratings of both the previous responses of the patient, and an overall
57300 evaluation of the patient with regard to the paranoid continuum.
57400 Thirty three protocol judges (a good response rate for psychiatric
57500 questionnaires) returned the rated protocols properly filled out and
57600 all were used in our data.
57700
57800 The interviews with nonparanoid patients were included to
57900 control for the hypothesis that any teletyped interview with a
58000 patient might be judged "paranoid". Since virtually all of the
58100 ratings of the nonparanoid inter- views were 0 for paranoia, the
58200 hypothesis was falsified.
58300
58400
58500 RESULTS
58600 The first index of resemblance examined was the simple one
58700 defined by the final overall rating given the patient and the model:
58800 which was rated as being more paranoid, the patient, the model, or
58900 neither? (See Table 1) The protocol judges are more likely to
59000 distinquish the overall paranoid level of the model and the patient.
59100 In 37.5% of the paired interviews, the interview judges gave tied
59200 scores to the model and the patient as contrasted to only 9% of the
59300 protocol judges. Of the 35 non-tied paired ratings 15 rated the
59400 model as more paranoid. If p is the theoretical probability of a
59500 judge judging the model more paranoid than a human paranoid patient,
59600 we find the 95% confidence interval for p to be .27 to .59. Since
59700 p=.5 indicates indistinguishability of model and patient overall
59800 ratings and our observed p=.43, the results support the claim that
59900 the model is a good simulation of a paranoid patient.
60000
60100 Separate analysis of the strong and weak versions of the paranoid
60200 model indicates that indeed the strong model is judged more paranoid
60300 than the patients, the weak version less paranoid. Thus a change in
60400 the parameter structure of the paranoid model produces a change along
60500 the dimension of paranoid behavior in the expected direction.
60600
60700 TABLE 1
60800 Relative final overall ratings of paranoid model vs. paranoid patient
60900 indicating which was given highest overall rating of paranoia at end
61000 of interview.
61100 INSERT TABLE 1
61200
61300
61400
61500
61600
61700
61800
61900
62000 END OF TABLE 1
62100
62200 The second index of resemblance is a more sensitive measure based on
62300 the two series of response ratings in the paired interviews. The
62400 statistic used is basically the standardized Mann-Whitney statistic
62500 [Siegel].
62600 INSERT EQUATION
62700
62800 where R is the sum of the ranks of the response ratings in the series
62900 of ratings given to the model, n the number of responses given by the
63000 model, m the number of responses given by the patient. If the
63100 ratings given by a judge are randomly allocated to model and patient,
63200 i.e. model and patient are indistinguishable in response ratings, the
63300 expected value of Z is 0, with unit standard deviation. If higher
63400 ratings are more likely to be assigned to the model, Z is positive
63500 and, conversely, negative values of Z indicate greater likelihood of
63600 assigning higher ratings to the patient. Each judge in evaluating a
63700 pair of interviews generates a single value of Z.
63800
63900 The overall mean of the Z scores was -.044 with the standard
64000 deviation 1.68(df=40). Thus the overall 95% confidence interval for
64100 the asymtotic mean value of Z -.485 to +.573. The range of Z values
64200 is -3.8 to +4.46. The length of the confidence interval is a result
64300 of the large variance which itself is mainly related to the contrast
64400 between the weak and strong versions. (See TABLES 2 and 3). Once
64500 again the strong version of the model is more paranoid than the
64600 patients, the weak version less paranoid.
64700
64800 TABLE 2
64900 Summary statistics of Z ratings by group
65000 In this design eight psychiatrists interviewed by teletype
65100 INSERT TABLE 2
65200
65300
65400
65500
65600
65700
65800
65900
66000
66100 END OF TABLE 2
66200 All judges (both interview and protocol) who evaluated the same pair
66300 of interviews are referred to as a "group". Strong groups evaluated
66400 strong versions of the paranoid model, while weak groups evaluated
66500 weak versions of the model.
66600
66700 It is not surprising that results using the two indices of
66800 resemblance are parallel, since the indices are highly interrelated.
66900 The mean Z value for the 15 interviews on which the model was rated
67000 more paranoid was +1.28, on the 6 where model and patient tied:.41,
67100 on the 20 in which the patient was more paranoid:-.993. A positive
67200 value of Z was observed when the patient was given an overall rating
67300 greater than the model 6 times;a negative value of Z when the model
67400 was rated more paranoid twice.
67500
67600 TABLE 3
67700 Analysis of Variance of Z Ratings
67800 INSERT TABLE 3
67900
68000
68100
68200
68300
68400
68500
68600
68700
68800 END OF TABLE 3
68900
69000 level of guessing.
69100
69200
69300 DISCUSSION
69400 The results of this experiment indicate our simulation of
69500 paranoid pro- cesses to be successful relative to the
69600 indistinguishability tests utilized. Thus it is an acceptable
69700 simulation as measured by the standard proposed.
69800
69900 It is worth emphasizing that our test invited refutation of
70000 the model. The experimental design of the tests put the model in
70100 jeopardy of falsi- fication. If the paranoid model did not survive
70200 these tests, i.e. if it were not considered paranoid by expert
70300 judges, if there were no correlation between the weak-strong versions
70400 of the model and the severity ratings of the judges, and if they
70500 could they could distinguish actual patient inter- views from
70600 computer program interviews, then no claim regarding the success of
70700 the simulation could be made. Survival of a falsification proceedure
70800 constitutes a validating step.
70900
71000 It is historically significant that these experiments were
71100 conducted at all. To our knowledge no one to date has subjected his
71200 model of human mental processes to such challenging
71300 indistinguishability tests. Other competing models are needed in the
71400 field of psychopathology. These tests set a precedent and provide a
71500 standard for competing models to be measured against. The general
71600 area of computer simulation of mental processes needs not only better
71700 models but better tests and statistical measures of resemblance. The
71800 problems of appropriate critical experimental designs and measures
71900 provide a promising frontier for future work.
72000 non-verbal cues are made impossible (Hilf,1972). Each judge
72100 To ask the machine-question, we sent interview transcripts,
72200 one with a patient and one with PARRY, to 100 psychiatrists randomly
72300 selected from the Directory of American Specialists and the Directory
72400 of the American Psychiatric Association. Of the 41 replies 21 (51%)
72500 made the correct identification while 20 (49%) were wrong. Based on
72600 this random sample of 41 psychiatrists, the 95% confidence interval
72700 is between 35.9 and 66.5, a range which is close to chance. (Our
72800 statistical consultant was Dr. Helena C. Kraemer, Research
72900 Associate in Biostatistics, Department of Psychiatry, Stanford
73000 University.)
73100 Psychiatrists are considered expert judges of patient
73200 interview behavior but they are unfamiliar with computers. Hence we
73300 conducted the same test with 100 computer scientists randomly
73400 selected from the membership list of the Association for Computing
73500 Machinery, ACM. Of the 67 replies 32 (48%) were right and 35 (52%)
73600 were wrong. Based on this random sample of 67 computer scientists the
73700 95% confidence ranges from 36 to 60, again close to a chance level.
73800 Thus the answer to this machine-question "can expert judges,
73900 psychiatrists aand computer scientists, using teletyped transcripts
74000 of psychiatric interviews, distinguish between paranoid patients and
74100 a simulation of paranoid processes? " is "No". But what do we learn
74200 from this? It is some comfort that the answer was not "yes"and the
74300 null hypothesis (no differences) failed to be rejected, especially
74400 since statistical tests are somewhat biased in favor of rejecting the
74500 null hypothesis (Meehl,1967). Yet this answer does not tell us what
74600 we would most like to know, i.e. how to improve the model.
74700 Simulation models do not spring forth in a complete, perfect and
74800 final form; they must be gradually developed over time. Pehaps we
74900 might obtain a "yes" answer to the machine-question if we allowed a
75000 large number of expert judges to conduct the interviews themselves
75100 rather than studying transcripts of other interviewers. It would
75200 indicate that the model must be improved but unless we systematically
75300 investigated how the judges succeeded in making the discrimination we
75400 would not know what aspects of the model to work on. The logistics of
75500 such a design are immense and obtaining a large N of judges for sound
75600 statistical inference would require an effort disproportionate to the
75700 information-yield.
75800 MULTIDIMENSIONAL EVALUATION
75900 A more efficient and informative way to use Turing-like tests
76000 is to ask judges to make ordinal ratings along scaled dimensions from
76100 teletyped interviews. We shall term this approach asking the
76200 dimension-question. One can then compare scaled ratings received by
76300 the patients and by the model to precisely determine where and by how
76400 much they differ. Model builders strive for a model which
76500 shows indistinguishability along some dimensions and
76600 distinguishability along others. That is, the model converges on what
76700 it is supposed to simulate and diverges from that which it is not.
76800 We mailed paired-interview transcripts to another 400
76900 randomly selected psychiatrists asking them to rate the responses of
77000 the two `patients' along certain dimensions. The judges were divided
77100 into groups, each judge being asked to rate responses of each I-O
77200 pair in the interviews along four dimensions. The total number of
77300 dimensions in this test were twelve- linguistic noncomprehension,
77400 thought disorder, organic brain syndrome, bizarreness, anger, fear,
77500 ideas of reference, delusions, mistrust, depression, suspiciousness
77600 and mania. These are dimensions which psychiatrists commonly use in
77700 evaluating patients.
77710 (INSERT TABLE 4 HERE)
77800 Table 4 shows there were significant differences, with PARRY
77900 receiving higher scores along the dimensions of linguistic
78000 noncomprehension,thought disorder, bizarreness, anger, mistrust and
78100 suspiciousness. On the dimension of delusions the patients were rated
78200 significantly higher. There were no significant differences along the
78300 dimensions of organic brain syndrome,fear, ideas of reference,
78400 depression and mania.
78500 While tests asking the machine-question indicate
78600 indistinguishability at the gross level, a study of the finer or
78700 micro-structure of the model's behavior through ratings along scaled
78800 dimensions shows statistically significant differences between
78900 patients and model. These differences are of help to the model
79000 builder in suggesting which aspects of the model must be modified and
79100 improved in order to be considered an adequate simulation of the
79200 class of paranoid patients it is intended to simulate. For example,
79300 it is clear that PARRY'S language-comprehension must be improved.
79400 Once this has been implemented, a future test will tell us whether
79500 improvement has occurred and by how much in comparison to the earlier
79600 version. Successive identification of particular areas of failure
79700 in the model permits their improvement and the development of more
79800 adequate model-versions.
79900 Further evidence that the machine-question is too coarse and
80000 insensitive a test comes from the following experiment. In this test
80100 we constructed a random version of the paranoid model which utilized
80200 PARRY'S output statements but expressed them randomly no matter what
80300 the interviewer said. Two psychiatrists conducted interviews with
80400 this model, transcripts of which were paired with patient interviews
80500 and sent to 200 randomly selected psychiatrists asking both the
80600 machine-question and the dimension-question. Of the 69 replies, 34
80700 (49%) were right and 35 (51%) wrong. Based on this random sample of
80800 69 psychiatrists, the 95% confidence interval ranges from 39 to 63,
80900 again indicating a chance level. However as shown in Table 5
81000 significant differences appear along the dimensions of linguistic
81100 noncomprehension, thought disorder and bizarreness, with RANDOM-PARRY
81200 rated higher. On these particular dimensions we can construct a
81300 continuum in which the random version represents one extreme, the
81400 actual patients another. Our (nonrandom) PARRY lies somewhere between
81500 these two extremes, indicating that it performs significantly better
81600 than the random version but still requires improvement before being
81700 indistinguishable from patients.(See Fig.1-graph). Table 6 presents t
81800 values for differences between mean ratings of PARRY and
81900 RANDOM-PARRY. (See Table 5 and Fig.1 for the mean ratings).
82000 Thus it can be seen that such a multidimensional evaluation
82100 provides yardsticks for measuring the adequacy of this or any other
82200 dialogue simulation model along the relevant dimensions.
82300 We conclude that when model builders want to conduct tests of
82400 adequacy which indicate in which direction progress lies and to
82500 obtain a measure of whether progress is being achieved, the way to
82600 use Turing-like tests is to ask expert judges to make ratings along
82700 multiple dimensions that are essential to the model. A good
82800 validation procedure has criteris for better or worse approximations.
82900 Useful tests do not prove a model, they probe it for its strengths
83000 and weaknesses and clarify what is to be done next in modifying and
83100 repairing the model. Simply asking the machine-question yields little
83200 information relevant to what the model builder most wants to know,
83300 namely, along what dimensions must the model be improved.
83400
83500